DATA CLEANING (HUGO)

years_1 <- c(1900:2012, 2014)
years_2 <- c(2015:2019)
importing_data = function(x){
 
  if(str_detect(x, str_c(years_1, collapse = "|"))) {
  read_csv(x, na = c("NULL", "", "0"), col_types = "cicccciiiicc") 
  } 
  
  else if(str_detect(x, str_c(years_2, collapse = "|"))){
    read_csv(x, na = c("NULL", "", "0"), col_types = "cccicccccccccccccccccciiiiccc")
  }
}
boston_df <- 
  tibble(list.files("data", full.names = TRUE)) %>% 
  setNames("file_name") %>% 
  mutate(data = map(file_name, importing_data)) %>% 
  unnest(data) %>% 
  mutate(year = readr::parse_number(file_name),
         city = coalesce(city, residence),
         display_name = str_replace_all(display_name, "[^a-zA-Z0-9]", " ")) %>% 
  mutate(country_residence = replace(country_residence, country_residence == "AHO", "Netherland Antilles")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ALB", "Albania")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ALG", "Algeria")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "AND", "Andorra")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ARG", "Argentina")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Argenti", "Argentina")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "AUS", "Australia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Austral", "Australia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "AUT", "Austria")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BAH", "Bahamas")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BAR", "Barbados")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Barbado", "Barbados")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BDI", "Burundi")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BLR", "Belarus")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BEL", "Belgium")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BER", "Bermuda")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BRA", "Brazil")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BRN", "Brunei")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "BUL", "Bulgaria")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CAN", "Canada")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CAY", "Cayman")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CHI", "Chile")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CHN", "China")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "COL", "Colombia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Colombi", "Colombia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CRC", "Costa Rica")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Costa R", "Costa Rica")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CRO", "Croatia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CYP", "Cyprus")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "CZE", "Czech Republic")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Czech R", "Czech Republic")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "DEN", "Denmark")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "DOM", "Dominican Republic")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Dominic", "Dominican Republic")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ECU", "Ecuador")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "EGY", "Egypt")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "El Salv", "El Salvador")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ESA", "El Salvador")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ESP", "Spain")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "EST", "Estonia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ETH", "Ethiopia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Ethiopi", "Ethiopia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Faroe I", "Faroe Islands")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "FIN", "Finland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "FLK", "Falkland Islands")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "FRA", "France")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "GBR", "England")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "GER", "Germany")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "GRE", "Greece")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "GRN", "Greenland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "GUA", "Guatemala")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Guatema", "Guatemala")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "HKG", "Hong Kong")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Hong Ko", "Hong Kong")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "HON", "Honduras")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Hondura", "Honduras")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "HUN", "Hungary")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "INA", "Indonesia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Indones", "Indonesia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "IND", "India")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "IRL", "Ireland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ISL", "Iceland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ISR", "Israel")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ITA", "Italy")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "JAM", "Jamaica")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "JPN", "Japan")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "JOR", "Jordan")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "KEN", "Kenya")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "KOR", "Korea")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Korea,", "Korea")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "KSA", "Saudi Arabia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "KUW", "Kuwait")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "LAT", "Latvia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "LIE", "Liechtenstein")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Liechte", "Liechtenstein")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Lithuan", "Lithuania")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "LTU", "Lithuania")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "LUX", "Luxembourg")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Luxembo", "Luxembourg")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Macao S", "Macao")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Macedon", "Macedonia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Malaysi", "Malaysia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "MAR", "Martinique")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Martini", "Martinique")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "MAS", "Malaysia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "MEX", "Mexico")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "MGL", "Mongolia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "MLT", "Malta")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "NCA", "Nicaragua")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "NED", "Netherlands")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Netherl", "Netherlands")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "New Zea", "New Zealand")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "NGR", "Nigeria")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "NOR", "Norway")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "NZL", "New Zealand")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "OMA", "Oman")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "PAK", "Pakistan")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Palesti", "Palestine")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "PAN", "Panama")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "PAR", "Paraguay")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Paragua", "Paraguay")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "PER", "Peru")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "PHI", "Philippines")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Philipp", "Philippines")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "POL", "Poland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "POR", "Portugal")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Portuga", "Portugal")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Puerto", "Puerto Rico")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "QAT", "Qatar")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ROU", "Romania")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Saudi A", "Saudi Arabia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SIN", "Singapore")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Singapo", "Singapore")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SLO", "Slovenia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Slovaki", "Slovakia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Sloveni", "Slovenia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SMR", "San Marino")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "South A", "South Africa")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SRB", "Serbia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Sri Lan", "Sri Lanka")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SUI", "Switzerland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SVK", "Slovakia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "SWE", "Sweden")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Switzer", "Switzerland")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "TCA", "Turks and Caicos")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "THA", "Thailand")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Thailan", "Thailand")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "TPE", "Taipei")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "TRI", "Trinidad")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Trinida", "Trinidad")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "TUR", "Turkey")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "TWN", "Taiwan")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "UAE", "United Arab Emirates")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "UGA", "Uganda")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "UKR", "Ukraine")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "United", "United States")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "URU", "Uruguay")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "USA", "United States")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "VEN", "Venezuela")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Venezue", "Venezuela")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "VGB", "Virgin Islands")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "VIE", "Vietnam")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "ZIM", "Zimbabwe")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "RSA", "South Africa")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "RUS", "Russia")) %>%
  mutate(country_residence = replace(country_residence, country_residence == "Russian", "Russia")) %>%
  filter(!is.na(display_name)) %>% 
  select(-file_name, -residence, -first_name, -last_name)

SUHANI

Geographical Analysis of Winners

Overview


We were interested in creating a map with the locations where winners from the Boston marathon over the past 120 year are from. Additionally, we wanted to examine how the locations of winners changed over time. We also analyzed data from winners from the wheel chair division and compared results to the men and women open divsion.

Data Cleaning


Plots


Map of Male Champions

The continent with the largest nuber of male winners from 1900 and beyond is North American, specifically the US, followed by Africa. There are 44 winners from the United States, 20 from Kenya, 10 from Japan.


Map of Female Champions

This is in contrast to female winners from 1900 and beyond where Africa is the continent with the most winners, closely followed by Europe, and North America. Most of the female winners in Africa are from Kenya. The map for female winners suggests that a participants proximinity to Boston likely does not play a role in their probability of winning the marathon.


Plot of Officials Times of Champions by Year in the Open Division

Over time, the winning time for both male and female winners has declined. While prior to 1950, most of the winners were from the United States, most of the winners are from Kenya. Additionally, we can see that the lowest marathon times have been from people from Kenya.


Map of Champions in the Wheel Chair Division

When assessing wheel chair division map, we can see that 25 winners are from the United States, followed by 19 from Europe, and 10 from Africa. While the male and female open division lack winners from Switzerland, there are 15 winners in the wheel chair division from Switzerland.


Plot of Official Times of Champions by Year in Wheel Chair Division

While there is a downward trend in official times for the open division, there is no trend in official times over time. Many of the winners pre-2000 are from the United States while many of the winners post-2000 are from South Africa. The lowest time recorded is from an individual in Switzerland.


Weather Analysis

Overview


We were interested in analyzing the weather patterns such as temperature, wind, and sky conditions during the Boston marathon over time. We were also curious if there was a relationship between weather and the winning time and participant average times.

Plots


Map of Weather during Boston Marathon

*Note datapoint size is related to wind speed.

Over the past 20 years, the temperature in Boston has generally stayed between 40-70 degrees and is usually a clear day. In 2012, there was a particularly hot marathon day with the tempature in the high 80’s. Over the past few years, there has been relativly low wind speeds on average with exception to 2015.


Wind Analysis during Boston Marathon

In the early 2000’s, the wind direction was generally toward the North/north east and recently has been generally toward the west, northwest direction. There has been relatively low wind speed with little wind speed varability the past few years during the Boston marathon. Around 2010, there are relatively higher wind speeds. In 2007, there was a particularly windy marathon with speeds ranging from 20-30 mph.


Comparison between Chamption Times and Weather

*Note datapoint size is related to wind speed.

The fastest marathon times in the past 20 years have occured when the temperature is around 55 degrees for men and around 60 degrees for woman. At temperatures 55 and below as well as 85 and above, there tends to be higher wind speeds. Although there are only a few datapoints above 70 degrees, one may hypothesize from the graph that as temperature increases, the fastest marathon time also increases.


Comparison between Participants Times and Weather

As temperature increases, the median marathon time among participants also increases also slightly increases. Furthermore, the lowested median marathon times occur around 55 degrees. We also see that there are many outliers in the higher timepoints.

TAMARA

cleaning data for winner over time plots

We are interested in looking at the trends of the winning Boston Marathon times from the 1900’s to 2019. To do this we first needed to convert data and time variables from character to POSIXCT.

From here with the cleaned up variables, we filtered the data set to just show the overall winners.

This plot shows that over time, the Boston marathon winners are getting faster. There also seems to be some data that was recorded incorectly. Jim Knaub and Franz Nietlispach are reported as having an official time of 1:22:17 and 1:25:59 but the fastest marathon ever run is 1:59 (though not recorded officially, https://www.nytimes.com/2019/10/12/sports/eliud-kipchoge-marathon-record.html)

The fastest time ever run (excluding the sub two hour times) for the Boston Marathon was run by Martin E Duffy in 1975, with a time of 2:04:54.

How do the Boston Marathon winners compare to the record marathon time for the same year?

To compare the Boston Marathon with the record marathon times, we found a table from www.topendsports.com where they recorded the fastest marathon time for that year starting from 1908 to 2018 (though not every year is included).

The plot shows the dramatic decrease in fastest recorded marathon times over the years. In 1908 the fastest recorded time was run by Johnny Hayes in the London Marathon in 2:55:18, compared to the current record holder, Eliud Kipchoge, running the 2:01:39 in the Berlin Marathon in 2018.

CLAIRE

Age distribution overall (counts)

This is plotly:

This is plain ggplot:

This is plotly from ggplot:

Histogram of percentages by age

plotly:

Density plot ages

SYDNEY